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Abstract 


This paper describes a new replication algorithm that is able 
to tolerate Byzantine faults. We believe that Byzantine- 
fault-tolerant algorithms will be increasingly important in 
the future because malicious attacks and software errors are 
increasingly common and can cause faulty nodes to exhibit 
arbitrary behavior. Whereas previous algorithms assumed a 
synchronous system or were too slow to be used in practice, 
the algorithm described in this paper is practical: it@WOrkS!in) 


(@Synchronous!environment® like the Internet and incorporates 
several important optimizations that improve the response time 
of previous algorithms by more than an order of magnitude. We 
implemented a Byzantine-fault-tolerant NFS service using our 
algorithm and measured its performance. The results show that 
our service is only 3% slower than a standard unreplicated NFS. 


1 Introduction 


Malicious attacks and software errors are increasingly 
common. The growing reliance of industry and gov- 
ernment on online information services makes malicious 
attacks more attractive and makes the consequences of 
successful attacks more serious. In addition, the number 
of software errors is increasing due to the growth in size 
and complexity of software. Since malicious attacks and 
software errors can cause faulty nodes to exhibit Byzan- 
tine (i.e., arbitrary) behavior, Byzantine-fault-tolerant al- 
gorithms are increasingly important. 


This paper presents a new, (racticalWalgorithm for) 
state machine replication [17, 34] that tolerates Byzantine 


rely on theGynchronylassumption\foricorrectness, which 


and replication techniques that tolerate Byzantine faults 
(starting with [19]). However, most earlier work (e.g., 
[3, 24, 10]) either concerns techniques designed to 
demonstrate theoretical feasibility that are too inefficient 
to be used in practice, or assumes synchrony, i.e., 
relies on known bounds on message delays and process 
speeds. The systems closest to ours, Rampart [30] and 
SecureRing [16], were designed to be practical, but they 


» An 
attacker may @6mipromiise the safety of a service@byD 


Our algorithm is not vulnerable to this type of 
attack because it does not rely on synchrony for 
safety. In addition, it improves the performance of 
Rampart and SecureRing by more than an order of 


magnitude as explained in Section 7. It Jony one 


(@yptosraphiyy which was cited as the major @aténey [29] 
G@haithrOughpat [22] GOtleHetk in Rampart, dsyusedyonlyp 
‘when there are faults. 

To evaluate our approach, we implemented a replica- 
tion library and used it to implement a real service: a 


G The algorithm Gffers"bothMlivenesstandisatety 


Byzantine-fault-tolerant distributed file system that sup- 


EE means thathientsjeventuallyp 


receive replies to their requests and those replies are 
(Correchaccordingtolinearizability (14, 4]. The algorithm 
(Worksiilasynchronousisystems like the Internet and it 
incorporates important @ptimizations that enable t 


There is a significant body of work on agreement 
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ports the NFS protocol. We used the Andrew bench- 
mark [15] to evaluate the performance of our system. The 
results show that our system is only 3% slower than the 
standard NFS daemon in the Digital Unix kernel during 
normal-case operation. 

Thus, the paper makes the following contributions: 


e It describes the first state-machine replication proto- 
col that correctly survives Byzantine faults in asyn- 
chronous networks. 

e It describes a number of important optimizations that 


allow the algorithm to perform well so that it can be 
used in real systems. 


e It describes the implementation of a Byzantine-fault- 
tolerant distributed file system. 


e It provides experimental results that quantify the cost 
of the replication technique. 


The remainder of the paper is organized as follows. 
We begin by describing our system model, including our 
failure assumptions. Section 3 describes the problem 
solved by the algorithm and states correctness conditions. 
The algorithm is described in Section 4 and some 
important optimizations are described in Section 5. 
Section 6 describes our replication library and how 
we used it to implement a Byzantine-fault-tolerant 
NFS. Section 7 presents the results of our experiments. 
Section 8 discusses related work. We conclude with a 
summary of what we have accomplished and a discussion 
of future research directions. 


2 System Model 


We assume an asynchronous distributed system where 
nodes are connected by a network. Thelietworkemnay, 
We use a Byzantine failure model, i.e., fauiliy@iiodesy 

, subject only to the restriction 


mentioned below. 
For this assumption to be true in the presence of malicious 
attacks, some steps need to be taken, e.g.,@@ach mode 


should run different implementations of the service code 

@mdgoperatingysystem and should have a@@ifferentiroom 
@assword andadifferentadmMinistatom It is possible 
to obtain different implementations from the same code 
base [28] and for low degrees of replication one can buy 
operating systems from different vendors. N-version 
programming, i.e., different teams of programmers 
produce different implementations, is another option for 

some Services. 

We use cryptographic techniques to prevent spoofing 
“and replays and to detect corrupted messages. Our 
"messages contain public-key signatures [33], message 

CD D 
oisinnean asni [32]. We denote a 
message m signed by node i as (m)o; and the digest of 
messager DAT We follow the common practice 
of signing a digest of a message and appending it to 


the plaintext of the message rather than signing the full 
amessage ((m),, should be interpreted in this way). © 
replicas know the others’ public keys to verify signatures. 


We allow for a very strong @@dversary that (Gan 
‘coordinate faulty nodes, delay communication, or delay 
(@orrechnGdes) in order to cause the most damage to the 
replicated service. We do assume that the @dWérsaly 
Camnotideélayicormectiodestindefinitely. We also assume 
that the adversam (and the faulty nodes it controls) 


are computationally bound so that (with very high 
probability) it is Gilable_towsubvertithetcryptographio 

eige mentioned above. For example, the adversary 
cannot produce a valid signature of a non-faulty node, 
compute the information summarized by a digest from 
the digest, or find two messages with the same digest. 
The cryptographic techniques we use are thought to have 
these properties [33, 36, 32]. 


3 Service Properties 


Our algorithm can be used to implement any deterministic 
‘replicated service with a state and some operations. The 
@perationS are not restricted to simple reads or writes of 
portions of the service state; they Gam perform arbitrary”) 
deterministic computations using the state and operation 
arguments.’ Clients issue requests to the replicated service | 
to invoke operations and block waiting for a reply. The 
replicatedG@ervicelisiimplementedibymireplicas. Clients 
and replicas are Gon=faultyatitheyitollowihetalgorithm 
in Section 4 and if no attacker can forge their signature. 
The algorithm provides both safety and liveness assum- 
ing no more than | "> | replicas are faulty. Saifetymeans 
that the replicated service satisfies linearizability [14] 


(modified to account for Byzantine-faulty clients [4]): it 


(Gperationsiatomicallyionelabaltime: Safety requires the 
bound on the number of faulty replicas because a faulty 
replica can behave arbitrarily, e.g., it can destroy its state. 


‘Safety is provided regardless of how many faulty 
amisa esei (even if they collude with 
faulty replicas): alljoperations\performedibyfaultyichients) 
are observed in a consistent way by non-faulty clients. 


In particular, if the service operations are designed to 
preserve some invariants on the service state, faulty 
clients cannot break those invariants. 

The safety property is insufficient to guard against 
faulty clients, e.g., in a file system a faulty client can 
write garbage data to some shared file. However, we 
limit the amount of damage a faulty client can do by 
providing access control: we authenticate clients and 
deny access if the client issuing a request does not have 
the right to invoke the operation. Also, services may 
provide operations to change the access permissions for 
a client. Ginee the algorithm ensures that the 
access revocation{operationslarelObserved consistently by 

‘all clients, this provides a powerful mechanism to recover 
comaa by faulty clients. 

The algorithm does not rely on synchrony to provide 
safety. Therefore, dtmustmelyjonjsynchronyitoyprovide 
liveness; otherwise it could be used to implement 
consensus in an asynchronous system, which is not 
possible [9]. We guarantee liveness, i.e., @lients 
eventually receive replies to their requests, provided at 
‘most |"5+] replicas are faulty and delay(t) does not 


` grow faster than ¢ indefinitely. Here, delay(t) is the 


. (A more precise definition 


‘message until it is received) 

can be found in [4].) This is a rathen§weakisynchrony 
assumption that is likely to be true in any real system 
provided network faults are eventually repaired, yet it 
enables us to circumvent the impossibility result in [9]. 


The @ésiliéney of our algorithmisopima rsy Prsi 
‘minimum number of replicas that allow an asynchronous 


‘system to provide the safety and liveness properties when — 
apltolireplicaslarelfaultyp (see [2] for a proof). This 


many replicas are needed because it must be possible to 
proceed after communicating with n — f replicas, since 
f replicas might be faulty and not responding. However, 
it is possible that the f replicas that did not respond are 
not faulty and, therefore, f of those that responded might 
be faulty. Even so, there must still be enough responses 
that those from non-faulty replicas outnumber those from 
faulty ones, i.e., n — 2f > f. Therefore n > 3f. 
The algorithm does not address the problem of fault- 
tolerant privacy aay replica may leakuntormationto 
GMa KE It is not feasible to offer fault-tolerant privacy 
in the general case because service operations may 
perform arbitrary computations using their arguments and 
the service state; replicas need this information in the 
clear to execute such operations efficiently. It is possible 
to use secret sharing schemes [35] to obtain privacy even 
in the presence of a threshold of malicious replicas [13] 
for the arguments and portions of the state that are opaque 
to the service operations. We plan to investigate these 
techniques in the future. 


4 The Algorithm 
Our algorithm is a form of Sate machine replicato [17, 


34]: the service is modeled as a state machine that is 
replicated across different nodes in a distributed system. 
Each state machine replica maintains the service state 

We denote the 

and identify each replica using an 

integer in {0,...,|R| — 1}. For simplicity, we assume 
where f is the maximum number of 

replicas that may be faulty; although there could be 
more than 3 f + 1 replicas, the @dditionalleplicasidegradep 
eoma (since more and bigger messages are being 

exchanged) without providing improved resiliency. 

Thelf@plicasimovelthroug a succession of configura- 
tions called views. In a view one replica is the primary 


(GaSTAUEAPViewstamped Replication [26] and Paxos [18] 


> can be primary indefinitely? 


used a similar approach to tolerate benign faults (as dis- 
cussed in Section 8.) 
The algorithm works roughly as follows: 


1. A client sends a request to invoke a service operation — 
< to the primary 

2. The primary multicasts the request to the backups — 

“3. Replicas execute the request and send a reply to the 
‘dlient 

‘4. The client waits for f + 1 replies from different 
aplicase this is the result of 


the operation. 
Like all state machine replication techniques [34], 
we impose two 
(i.e., the execution of an operation in 
a given state and with a given set of arguments must 


always produce the same result) and they must start inthe 
(AMEE Given these two requirements, the algorithm 


ensures the safety pope a Cane perme 


The remainder of this section describes a simplified 
version of the algorithm. Wel@mii— discussion of HéW? 
GOdESMTECSVERMToMfault> due to lack of space. We 
also omit details related to message retransmissions. 
Furthermore, we assuimelthatimessagelauthenticationns 
c@chievedtusing digitalsignatires rather than the more 


efficient scheme based on message authentication codes; 
Section 5 discusses this issue further. A detailed 
formalization of the algorithm using the I/O automaton 
model [21] is presented in [4]. 


4.1 The Client 


A dimo Fequest> the execution of state machine 

GPeratGHED by sending a@REQUESTHOWC)@, Message 

‘to the primary. Timestamp ¢ is used to ensure exactly- 

(ee semantics for the execution of client requests. 

Timestamps for c’s requests are totally ordered such that what 
happens 

for example, the timestamp could be the value of the jf c fakes 

client’s local clock when the request is issued. an earlier 


‘Each message sent by the replicas to the client includes 
G@HeCUREREVIEWHUMER allowing the client GTER the t? It looks 
view and hence theveirrent/primary) A client sends like a 
a request to Whatlit\believesiisiihelcinentipimay Message 
using a point-to-point message. The@primaryatomically that the 
GUNTICASTS thierequestioallitiebackups using the protocol network 
described in the next section. d roppe d. 
helene. The reply has the formai ori f 
ra what if 


where v is the current view number, t is the timestamp of 


the corresponding request ,@lisitheneplicanumibep and@ MESSAge 


is theP@8UlPof executing the requested operation. is sent to 
The client waits for f + 1 replies with valid signatures wrong 
from different replicas, and with the same ¢ and ry before primary? 


accepting theresult™ This ensures that the result is valid, 
since at most f replicas can be faulty. 

If the client does not receive replies soon enough, it 
broadcasts the request to all replicas. If the request has 
already been proce$sed, the replicas simply re-send the 
reply; replicas remember the last reply message they sent 
to each client. Otherwise iffhereplicaismortheprimary; 
it relays the request to the primary. If the primary does 
not multicast the request to the group, it will eventually 
be suspected to be faulty by enough replicas to cause a 
view change. 

In this paper we assume that the client waits for one 
request to complete before sending the next one. Bute 
can allow a client to make asynchronous requests, yet 
preserve ordering constraints on them. 


4.2 Normal-Case Operation 


The state of each replica includes the state of the 
service, a message log containing messages the replica 
has accepted, and an integer denoting the replica’s current 
View) We describe how to truncate the log in Section 4.3. 

When the primary, p, receives a client request, m, 
it starts a three-phase protocol to atomically multicast 
theyrequest\tothe replicas. The primary starts the 
protocol immediately unless the number of messages 
for which the protocol dsyingprogressyexceedsyaygiven 
maximum, In this case, it buffers the request. Buffered 
requests are niulficast/later asa) sroup)to cut down on 
message traffic and CPU overheads under heavy load; this 
optimization is similar to a group commit in transactional 
systems [11]. For simplicity, we ignore this optimization 
in the description below. 

The three phases are pre-prepare, prepare, and commit. 
The@re*preparelandiprepare phases are used totally 
order requests sent in the same view even when the 
primary, which proposes the ordering of requests, is 
faulty. The préparélandicommit phases are used tojensiire 
that requests that commit are totally ordered across views. 

In the prespreparé—phase, the Mima assigna 
sequence number, nf to the request, multicasts a pre- 
prepare message with m piggybacked to all the backups, 
andlappends the message tolitslog) The message has the 
form ((BRESPREPARE}O)7yd)eyymm), where v indicates 
the view in which the message is being sent, m is the 
client’s request message, and@iiswn’sidigest: 

Requests are not inelided@ in pre-prepare messages 
to keep them small. This is important because@pre= 
prepare messages are used as a proof that the request was 
assigned sequence number n in view v in view changes. 
Additionally, it decouples the protocol to totally order 
requests from the protocol to transmit the request to the 
replicas; allowing us to use a(transport Optimized” for 
Small)messages) for protocol messages and a transport 
optimized for large messages for large requests. 


Ay backupracceptsialpre=prepare message provided: 
e the signatures in the request and the pre-prepare 
message are correct and d is the digest for m; 
e itis in view v; 
e it has not accepted a pre-prepare message for view v 
and sequence number n containing a different digest; 
e the sequence number in the pre-prepare message is 
between a low water mark, h, and a high water mark, 
H. 
The last condition evem aay primary from 
exhausting the space of sequence numbers by selecting 
avem largeone) We discuss how H and h advance in 
Section 4.3. 

Ifbackiipwlaccepts the ((PRE-PREPARE, V, N, d)o,, M) 
message, it enters the prepare phase by multicasting a 
(PREPARE, U, n, d, 1), Message to all other replicas and 
adds both messagestoits log Otherwise, it does nothing. 

A replica (including the primary) accepts prepare 
messages and adds them to its log provided their 
signatures are correct, their view number equals the 
replica’s current view, and their sequence number is 
between h and H. 

We define the predicate prepared(myoynyi) to beltrue 
if and only if replica į has inserted in its log: the request 
m, a pre-prepare for m in view v with sequence number 
n, and 2f prepares from different backups that match 
the pré=préparé) The replicas verify whether the prepares 
match the pre-prepare by checking that they have the 
same view, sequence number, and digest. 

The pre2prépare and prepare phases of the algorithm 
guarantee that non-faulty replicas agree on a total order 
for the requests within a view. More precisely, they 
ensure the following invariant: @fiprepared(myoymyi lis 
true then prepared(m', v, n, j) is false for any non-faulty 
replica j (including i = j) and any m’ such that D(m') 4 
D(m). This is true because prepared(m,v,n,i) and 
|R| = 3f +1 imply that at least f + 1 non-faulty replicas 
have sent a pre-prepare or prepare for m in view v with 
sequence number n. ‘Thus, for prepared(m’,v,n, j) 
to be true at least one of these replicas needs to have 
sent two conflicting prepares (or pre-prepares if it is the 
primary for v), i.e., two prepares with the same view 
and sequence number and a different digest. But this is 
not possible because the replica is not faulty. Finally, our 
assumption about the strength of message digests ensures 
that the probability that m 4 m’ and D(m) = D(m’) is 
negligible. 

Replica? multicasts a (COMMIT, v, n, D(m),2),, to the 
other replicas when prepared(m, v, n, i) becomes true. 
This starts the commit phase. ‘Replicas accept commit 
messages and insert them in their log provided they are 
properly signed, the view number in the message is equal 
to the replica’s current view, and the sequence number is 
between h and H 


We define the committed and committed-local predi- 


cates as follows: @ommitted (myoym)lisitiruelitlandionly 
a commit matches a pre-prepare if they have the same 
view, sequence number, and digest. 


The e6mimit phase disies the following invariant: é 


anomie (posme. This invariant and 


the view-change protocol described in Section 4.4 ensure 


that non-faulty replicas agree on the sequence numbers ~ 


differentiviewslatleachireplica? Furthermore, it ensures 
that any request that commits locally at a non-faulty 


aowe Sequence mambe This ensures that all non- 


faulty replicas execute requests in the same order as 
required to provide the safety property. After executing 


the requested operation @@plicassendareply tothe|clienD 
@HeMMeStamplin thellastireplyatheysen to the client to 


guarantee exactly-once semantics. 

We do not rely on ordered message delivery, and 
therefore it is possible for a replica to commit requests 
@upofjorder This does not matter since it keeps the pre- 
prepare, prepare, and commit messages logged until the 
corresponding request can be executed. 

Figure 1 shows the operation of the algorithm in the 
normal case of no primary faults. Replica 0 is the primary, 
replica 3 is faulty, and Ç is the client. 


request Í pre-prepare! prepare ; commit reply 


Figure 1: Normal Case Operation 


4.3 Garbage Collection 


This section discusses the mechanism used to discard 
messages from the log. For the safety condition to hold, 


cn view changes. In ee 


by transferring all 
or a portion of the service state. Therefore, replicas also 


Generating these proofs after executing every opera- 
tion would be expensive. Instead, they are gemerated 


Gisiblesbysomerconstany(e.g., 100) Geena. We will 
refer to the states produced by the execution of these re- 
quests as@heckpointS and we will say thatjajcheckpointy 
with a proof is a stable checkpoint. 

Aveplicamaintains several logical copies of the service 
state: the last stable checkpoint, zero or more checkpoints 
@hatjaremortistableyandjayeurrentistate. Copy-on-write 
techniques can be used to reduce the space overhead 
to store the extra copies of the state, as discussed in 
Section 6.3. 

The proofjoficorrectnessjfor a checkpoint is generated 
as follows. WHénMalireplicaliprodiices\archeckpoint 


stich message): These 2 f + 1 messages are the proof of 
correctness for the checkpoint. 


A checkpoint with a proof becomes stable and the 


ín from its log; it also discards all earlier checkpoints and — 
‘checkpoint messages. 


Computing the proofs is efficient because the digest 
can be computed using incremental cryptography [1] as 
discussed in Section 6.3, and proofs are generated rarely. 

The checkpoint protocol is used to advance the low 
and high water marks (which limit what messages will 
be accepted). The 


Sabe. For example, if checkpoints are taken every 100 
requests, k might be 200. 


4.4 View Changes 


The view-change protocol providesiliveness by allowing 
the system to make progress when the primary fails. View 
changes are@riggered! by timeouts) that prevent backups 
from waiting indefinitely for requests to execute. A 
backup is waiting for a request if it received a valid request 


and has not executed it. A backup starts a timer when it 
receives a request and the timer is not already running. 
It stops the timer when it is no longer waiting to execute 
the request, but restarts it if at that point it is waiting to 
execute some other request. 


x = 
ie) 
; = | 
f$] 
=) 
a 


message (without the 


corresponding client message) 


sent), and O is a set of pre-prepare messages (without the 
piggybacked request). © is computed as follows: 


1. The primary determines the sequence number min-s 


2. The primary creates 


andma There are two cases: (D there is @HUGaSt 


message iwithisequencemumbernm, or (2) there 
is no such set. In the first case, the primary creates 
, where 
dis the request digest in the pre-prepare message for 
sequence number n with the highest view number 
in V. Intheysecondycase, it creates a new pre- 
prepare message 
where @2¥ is the digest of a special null request; 


Gegee buise (Paxos [18] 


used a similar technique to fill in gaps.) 


Next the primary appends the messages in © to its 
n EP EP 


insiso and discards information from the log 


‘contains are valid for view v + 1, and if the set O is 
OE it verifies the correctness of O by performing a 
computation similar to the one used by the primary to 
create O. Then 


Thereafter, the protocol proceeds as described in 
Section 4.2. Replicas redo the protocol for messages 
between min-s and max-s but they avoid re-executing 


client requests (by using their stored information about 
the last reply sent to each client). 


GET lipanrna yA ons ners omen eduesiamessecentiey 
Gra! stable\checkpoint(since these are not sent in new- 
view messages.) It can @btaiilimissin@unformationirony 

(@motherweplicay For example, replica 7 can obtain a 
missing checkpoint state s GPOmlGne/6Pihelreplicas 
whose checkpoint messages certified its correctness in 
W. Since f +1 of those replicas are correct,@eplicayiqwill 
always obtain s or a later certified stable checkpoint. We 
can avoid sending the entire checkpoint by @artitioningy 
the state and stamping each partition with the sequence 
number of the last request that modified it. To bring 
a replica up to date, i@iSionlyinecessaryitolsenduimthey 
Urey misnomer meatier rate 
checkpoint. — 


4.5 Correctness 


This section sketches the proof that the algorithm 
provides safety and liveness; details can be found in [4]. 


4.5.1 Safety 


As discussed earlier, the algorithm provides safety if all 
non-faulty replicas agree on the sequence numbers of 
requests that commit locally. 

In Section 4.2, we showed that if prepared(m, v, n, i) 
is true, prepared(m’,v,n, j) is false for any non-faulty 
replica j (including i = j) and any m’ such that 
D(m') #4 D(m). This implies that two non-faulty 
replicas agree on the sequence number of requests that 
commit locally in the same view at the two replicas. 

The view-change protocol ensures that non-faulty 
replicas also agree on the sequence number of requests 
that commit locally in different views at different replicas. 
A request m commits locally at a non-faulty replica with 
sequence number n in view v only if committed(m, v, n) 
is true. This means that there is a set R; containing at least 
f + 1 non-faulty replicas such that prepared(m, v, n, i) 
is true for every replica 2 in the set. 

Non-faulty replicas will not accept a pre-prepare for 
view v' > v without having received a new-view message 
for v’ (since only at that point do they enter the view). But 
any correct new-view message for view v’ > v contains 
correct view-change messages from every replica 2 in a 


set Ro of 2f + 1 replicas. Since there are 3 f + 1 replicas, 
R, and R must intersect in at least one replica k that is 
not faulty. k’s view-change message will ensure that the 
fact that m prepared in a previous view is propagated to 
subsequent views, unless the new-view message contains 
a view-change message with a stable checkpoint with a 
sequence number higher than n. In the first case, the 
algorithm redoes the three phases of the atomic multicast 
protocol for m with the same sequence number n and the 
new view number. This is important because it prevents 
any different request that was assigned the sequence 
number n in a previous view from ever committing. In 
the second case no replica in the new view will accept any 
message with sequence number lower than n. In either 
case, the replicas will agree on the request that commits 
locally with sequence number n. 


4.5.2 Liveness 


To provide liveness, replicas must move to a new view if 


they are unable to execute a request r 


@equiestedioperation\executes. We achieve these goals by 


three means. 


First, to avoid starting a view change too soon; a replica 


| 


Second, i 


this prevents it from starting the next view 


= 
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cannot cause a view change by sending a view-change 
message, because a view change will happen only if at 
least f + 1 replicas send view-change messages, but it 
can cause a view change when it is the primary (by not 
sending messages or sending bad messages). However, 
because the primary of view v is the replica p such that 
p = v mod |R], the primary cannot be faulty for more 
than f consecutive views. 

These three techniques guarantee liveness unless 
message delays grow faster than the timeout period 
indefinitely, which is unlikely in a real system. 


N 


4.6 Non-Determinism 


State machine replicas must be deterministic but many 
services involve some form of non-determinism. For 
example, the time-last-modified in NFS is set by reading 
the server’s local clock; if this were done independently 
at each replica, the states of non-faulty replicas would 
diverge. Therefore, some mechanism to ensure that all 
replicas select the same value is needed. In general, the 
client cannot select the value because it does not have 
enough information; for example, it does not know how 
its request will be ordered relative to concurrent requests 
by other clients. Instead, the primary needs to select the 
value either independently or based on values provided 
by the backups. 


Ghe request and value? This prevents a faulty primary from 


causing replica state to diverge by sending different val- 
ues to different replicas. However, a faulty primary might 
send the same, incorrect, value to all replicas. Therefore, 


This protocol is adequate for most services (including 


NFS) but occasionally replicas must participate in~ 


This can be accomplished by adding an extra phase to 
the protocol: the Primarylobtainstalithenticatedwvalues 


phase can be optimized away in the common case. For 
example, if replicas need a value that is “close enough” 
to that of their local clock, the extra phase can be avoided 
when their clocks are synchronized within some delta. 


5 Optimizations 


This section describes some optimizations that improve 
the performance of the algorithm during normal-case 
operation. All the optimizations preserve the liveness 
and safety properties. 


5.1 Reducing Communication 


We use three optimizations to reduce the cost of 
communication. The first avoids sending most large 


replies. A client request designates a replica to send 
tH@MGiBESHOL TEES The digests allow the client to 


check the correctness of the result while reducing network 


bandwidth consumption and CPU overhead significantly 


for large replies. (fithelclientdoesmonreceiveaicomect 
‘result from the designated replica, it retransmits the 
request as usual, requesting all replicas to send full 


replies. 
The second optimization reduces the number of 
message delays for an operation invocation from 5 


flaveleommited After executing the request, thef@pligas) 
sl The Aien wai for 


If it receives this 


many, haaa 


A request that has executed tentatively may abort if 
‘there is a view change and it is replaced by a null 
GAN In this case thePepheaeverteuitelstatento the 


last stable checkpoint in the new-view message or to its 
last checkpointed state (depending on which one has the 
higher sequence number). 
The third optimization improves the performance of 
read-only operations that do not modify the service 
state. 
‘replicas. Replicas execute the request immediately in 
their tentative state after checking that the request is 
properly authenticated, that the client has access, and 
that the request is in fact read-only. They send the reply 
only after all requests reflected in the tentative state have 
commited; this is necessary@@Olpreventithierchientiirom 
‘ observing uncommitted state. The client waits for 2 f +1 
@epliés from different replicas Withhesameresull, The 
client may be unable to collect 2 f +1 such replies if there 
‘are concurrent writes to data that affect the result; in this 
case, it retransmits the request as a regular read-write 


request after its retransmission timer expires. 


5.2 Cryptography 
In Section 4, we described an algorithm that uses 
digital signatures to authenticate all messages. However, 


change and new-view messages, which are sent rarely, 


and authenticate all other messages using message 
enico codes (MAGS). This eliminates the main 


performance bottleneck in previous systems [29, 22]. 
However, MAGs have a fundamental@limitauon rela- 
tive to digital signatures — the ifabilifyiteliprovelehap 
i i The algorithm 
in Section 4 and previous Byzantine-fault-tolerant algo- 
rithms [31, 16] for state machine replication rely on the 
extra power of digital signatures. We modified our algo- 
rithm to circumvent the problem by taking advantage of 


specific invariants, e.g, the invariant that no two different 
requests prepare with the same view and sequence num- 
ber at two non-faulty replicas. The modified algorithm is 
described in [5]. Here we sketch the main implications 
of using MACs. 

MACs can be computed three orders of magnitude 
faster than digital signatures. For example, a 200MHz 
Pentium Pro takes 43ms to generate a 1024-bit modulus 
RSA signature of an MD5 digest and 0.6ms to verify 
the signature [37], whereas it takes only 10.3ys to 
compute the MAC of a 64-byte message on the same 
hardware in our implementation. There are other public- 
key cryptosystems that generate signatures faster, e.g., 
elliptic curve public-key cryptosystems, but signature 
verification is slower [37] and in our algorithm each 
signature is verified many times. 


(Secretisession key witheach!eplical™ We compute 


message authentication codes by applying MDS to the 


Rather 
than using the 16 bytes of the final MD5 digest, welli8e 
GnlyitieMONeastsignificantibytes) This truncation has 
the obvious advantage of reducing the size of MACs and 
it also improves their resilience to certain attacks [27]. 
This is a variant of the secret suffix method [36], which 
is secure as long as MDS is collision resistant [27, 8]. 


; each entry is the 


MAC computed with the key shared by the sender and 
the replica corresponding to the entry. 
The time to verify an authenticator is constant but the 


time to generate one grows linearly with the number of h 


replicas. This is not a problem becauseWOlUOMOLEXpectD 


ow w arger 


and there is a huge Pey 


performance gap between MAC and digital signature 
computation. Furthermore, we compute authenticators 
efficiently; MD5 is applied to the message once and the 
resulting context is used to compute each vector entry 
by applying MD5 to the corresponding session key. For 
example, in a system with 37 replicas (i.e., a system 
that can tolerate 12 simultaneous faults) an authenticator 
can still be computed much more than two orders of 
magnitude faster than a 1024-bit modulus RSA signature. 
The size of authenticators grows linearly with the 
number of replicas but it grows slowly: it is equal to 
30 x [3+] bytes. An authenticator is smaller than an 
RSA signature with a 1024-bit modulus for n < 13 (i.e., 
systems that can tolerate up to 4 simultaneous faults), 
which we expect to be true in most configurations. 


6 Implementation 


This section describes our implementation. First we 
discuss the replication library, which can be used as 
a basis for any replicated service. In Section 6.2 we 
describe how we implemented a replicated NFS on top 
of the replication library. Then we describe how we 
maintain checkpoints and compute checkpoint digests 
efficiently. 


6.1 The Replication Library 

The client interface to the replication library consists of 
a single procedure, invoke, with one argument, an input 
buffer containing a request to invoke a state machine 
operation. The invoke procedure uses our protocol to 
execute the requested operation at the replicas and select 
the correct reply from among the replies of the individual 
replicas. It returns a pointer to a buffer containing the 
operation result. 

On the Gervergside®, the replication code makes a 
number of upcalls to procedures that the server part of 
the application must implement. There are procedures 
to execute requests (execute), to maintain checkpoints of 

` the service state (make_checkpoint, delete_checkpoint), to 
-obtain the digest of a specified checkpoint (get_digest), 
and to obtain missing information (get_checkpoint, 


“operation, and places the result in an output buffer. The 


other procedures are discussed further in Sections 6.3 
and 6.4. 

Point-to-point communication between nodes is imple- 
mented using UDP, and multicast to the group of replicas 
is implemented using UDP over IP multicast [7]. There 
is asingle IP multicast group for each service, which con- 
tains all the replicas. These communication protocols are 
unreliable; they may duplicate or lose messages or deliver 
them out of order. 


The algorithm tolerates out-of-order delivery and — 


can they primary retransmits pre-prepare messages after a long 


deadlock ee 


waiting es. During view changes, replicas retransmit 

fora siew-changemessages uni the receivea matchingnew- 

new-view 

on which The replication library does not implement view 

everybody changes or retransmissions at present. This does 

agrees? not compromise the accuracy of the results given 
j in Section 7 because the rest of the algorithm is 


completely implemented (including the manipulation of 
the timers that trigger view changes) and because we 
have formalized the complete algorithm and proved its 
correctness [4]. 


6.2 BFS: A Byzantine-Fault-tolerant File System 


We implemented BFS, a Byzantine-fault-tolerant NFS 
service, using the replication library. Figure 2 shows the 
architecture of BFS. We opted not to modify the kernel 
NFS client and server because we did not have the sources 
for the Digital Unix kernel. 

A file system exported by the fault-tolerant NFS service 
is mounted on the client machine like any regular NFS 
file system. Application processes run unmodified and 
interact with the mounted file system through the NFS 
client in the kernel. We rely on user level relay processes 
to mediate communication between the standard NFS 
client and the replicas. A relay receives NFS protocol 
requests, calls the invoke procedure of our replication 
library, and sends the result back to the NFS client. 


replica 0 


replication 
library 


kernel VM 
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Andrew ESEA 
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Figure 2: Replicated File System Architecture. 


Each replica runs a user-level process with the 
replication library and our NFS V2 daemon, which we 
will refer to as snfsd (for simple nfsd). The replication 
library receives requests from the relay, interacts with 
snfsd by making upcalls, and packages NFS replies into 
replication protocol replies that it sends to the relay. 

We implemented snfsd using a fixed-size memory- 
mapped file. All the file system data structures, e.g., 
inodes, blocks and their free lists, are in the mapped file. 
We rely on the operating system to manage the cache of 
memory-mapped file pages and to write modified pages 
to disk asynchronously. The current implementation 
uses 8KB blocks and inodes contain the NFS status 
information plus 256 bytes of data, which is used to store 
directory entries in directories, pointers to blocks in files, 
and text in symbolic links. Directories and files may also 
use indirect blocks in a way similar to Unix. 


Our implementation ensures that istae machino 


) 


<- The primary 
proposes the values for time-last-modified and time- 
last-accessed, and replicas select the larger of the 
proposed value and one greater than the maximum of all 
values selected for earlier requests. We do not require 
synchronous writes to implement NFS V2 protocol 
semantics because BFS achieves stability of modified 
data and meta-data through replication [20]. 


6.3 Maintaining Checkpoints 


This section describes how snfsd maintains checkpoints 
of the file system state. Recall that each replica maintains 
several logical copies of the state: the current state, some 
number of checkpoints that are not yet stable, and the last 
stable checkpoint. 

snfsd executes file system operations directly in the 
memory mapped file to preserve locality, and it uses copy- 
on-write to reduce the space and time overhead associated 
with maintaining checkpoints. snfsd maintains a copy- 
on-write bit for every 512-byte block in the memory 
mapped file. When the replication code invokes the 
make_checkpoint upcall, snfsd sets all the copy-on-write 
bits and creates a (volatile) checkpoint record, containing 
the current sequence number, which it receives as an 
argument to the upcall, and a list of blocks. This list 
contains the copies of the blocks that were modified 
since the checkpoint was taken, and therefore, it is 
initially empty. The record also contains the digest of 
the current state; we discuss how the digest is computed 
in Section 6.4. 

When a block of the memory mapped file is modified 
while executing a client request, snfsd checks the copy- 
on-write bit for the block and, if it is set, stores the block’s 
current contents and its identifier in the checkpoint record 
for the last checkpoint. Then, it overwrites the block 
with its new value and resets its copy-on-write bit. 
snfsd retains a checkpoint record until told to discard 
it via a delete_checkpoint upcall, which is made by the 
replication code when a later checkpoint becomes stable. 

If the replication code requires a checkpoint to send 
to another replica, it calls the get_checkpoint upcall. To 
obtain the value for a block, snfsd first searches for the 
block in the checkpoint record of the stable checkpoint, 
and then searches the checkpoint records of any later 
checkpoints. If the block is not in any checkpoint record, 
it returns the value from the current state. 

The use of the copy-on-write technique and the fact 
that we keep at most 2 checkpoints ensure that the space 
and time overheads of keeping several logical copies 
of the state are low. For example, in the Andrew 
benchmark experiments described in Section 7, the 
average checkpoint record size is only 182 blocks with a 
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maximum of 500. 


6.4 Computing Checkpoint Digests 


snfsd computes a digest of a checkpoint state as part 
of a make_checkpoint upcall. Although checkpoints 
are only taken occasionally, it is important to compute 
the state digest incrementally because the state may be 
large. snfsd uses an incremental collision-resistant one- 
way hash function called AdHash [1]. This function 
divides the state into fixed-size blocks and uses some 
other hash function (e.g., MD5) to compute the digest 
of the string obtained by concatenating the block index 
with the block value for each block. The digest of the 
state is the sum of the digests of the blocks modulo some 
large integer. In our current implementation, we use the 
512-byte blocks from the copy-on-write technique and 
compute their digest using MD5. 

To compute the digest for the state incrementally, snfsd 
maintains a table with a hash value for each 512-byte 
block. This hash value is obtained by applying MD5 
to the block index concatenated with the block value at 
the time of the last checkpoint. When make_checkpoint 
is called, snfsd obtains the digest d for the previous 
checkpoint state (from the associated checkpoint record). 
It computes new hash values for each block whose copy- 
on-write bit is reset by applying MD5 to the block index 
concatenated with the current block value. Then, it adds 
the new hash value to d, subtracts the old hash value 
from d, and updates the table to contain the new hash 
value. This process is efficient provided the number of 
modified blocks is small; as mentioned above, on average 
182 blocks are modified per checkpoint for the Andrew 
benchmark. 


7 Performance Evaluation 


This section evaluates the performance of our system 
using two benchmarks: a micro-benchmark and the 
Andrew benchmark [15]. The micro-benchmark provides 
a service-independent evaluation of the performance of 
the replication library; it measures the latency to invoke 
a null operation, i.e., an operation that does nothing. 

The Andrew benchmark is used to compare BFS with 
two other file systems: one is the NFS V2 implementation 
in Digital Unix, and the other is identical to BFS except 
without replication. The first comparison demonstrates 
that our system is practical by showing that its latency is 
similar to the latency of a commercial system that is used 
daily by many users. The second comparison allows us to 
evaluate the overhead of our algorithm accurately within 
an implementation of a real service. 


7.1 Experimental Setup 


The experiments measure normal-case behavior (i.e., 
there are no view changes), because this is the behavior 


that determines the performance of the system. All 
experiments ran with one client running two relay 
processes, and four replicas. Four replicas can tolerate 
one Byzantine fault; we expect this reliability level to 
suffice for most applications. The replicas and the 
client ran on identical DEC 3000/400 Alpha workstations. 
These workstations have a 133 MHz Alpha 21064 
processor, 128 MB of memory, and run Digital Unix 
version 4.0. The file system was stored by each replica 
on a DEC RZ26 disk. All the workstations were 
connected by a 10Mbit/s switched Ethernet and had DEC 
LANCE Ethernet interfaces. The switch was a DEC 
EtherWORKS 8T/TX. The experiments were run on an 
isolated network. 

The interval between checkpoints was 128 requests, 
which causes garbage collection to occur several times in 
any of the experiments. The maximum sequence number 
accepted by replicas in pre-prepare messages was 256 
plus the sequence number of the last stable checkpoint. 


7.2 Micro-Benchmark 


The micro-benchmark measures the latency to invoke 
a null operation. It evaluates the performance of two 
implementations of a simple service with no state that 
implements null operations with arguments and results 
of different sizes. The first implementation is replicated 
using our library and the second is unreplicated and 
uses UDP directly. Table 1 reports the response times 
measured at the client for both read-only and read- 
write operations. They were obtained by timing 10,000 
operation invocations in three separate runs and we report 
the median value of the three runs. The maximum 
deviation from the median was always below 0.3% of 
the reported value. We denote each operation by a/b, 
where a and b are the sizes of the operation argument and 
result in KBytes. 


arg./res. replicated without 
(KB) read-write read-only | replication 
0/0 3.35 (309%) | 1.62 (98%) 0.82 
4/0 14.19 (207%) | 6.98 (51%) 4.62 
0/4 8.01 (72%) 5.94 (27%) 4.66 


Table 1: Micro-benchmark results (in milliseconds); the 
percentage overhead is relative to the unreplicated case. 


The overhead introduced by the replication library is 
due to extra computation and communication. For exam- 
ple, the computation overhead for the read-write 0/0 op- 
eration is approximately 1.06ms, which includes 0.55ms 
spent executing cryptographic operations. The remain- 
ing 1.47ms of overhead are due to extra communication; 
the replication library introduces an extra message round- 
trip, it sends larger messages, and it increases the number 
of messages received by each node relative to the service 
without replication. 
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The overhead for read-only operations is significantly 
lower because the optimization discussed in Section 5.1 
reduces both computation and communication overheads. 
For example, the computation overhead for the read-only 
0/0 operation is approximately 0.43ms, which includes 
0.23ms spent executing cryptographic operations, and 
the communication overhead is only 0.37ms because the 
protocol to execute read-only operations uses a single 
round-trip. 

Table 1 shows that the relative overhead is lower for 
the 4/0 and 0/4 operations. This is because a significant 
fraction of the overhead introduced by the replication 
library is independent of the size of operation arguments 
and results. For example, in the read-write 0/4 operation, 
the large message (the reply) goes over the network 
only once (as discussed in Section 5.1) and only the 
cryptographic overhead to process the reply message is 
increased. The overhead is higher for the read-write 4/0 
operation because the large message (the request) goes 
over the network twice and increases the cryptographic 
overhead for processing both request and pre-prepare 
messages. 

It is important to note that this micro-benchmark 
represents the worst case overhead for our algorithm 
because the operations perform no work and the 
unreplicated server provides very weak guarantees. 
Most services will require stronger guarantees, e.g., 
authenticated connections, and the overhead introduced 
by our algorithm relative to a server that implements these 
guarantees will be lower. For example, the overhead 
of the replication library relative to a version of the 
unreplicated service that uses MACs for authentication 
is only 243% for the read-write 0/0 operation and 4% for 
the read-only 4/0 operation. 

We can estimate a rough lower bound on the 
performance gain afforded by our algorithm relative to 
Rampart [30]. Reiter reports that Rampart has a latency 
of 45ms for a multi-RPC of a null message in a 10 Mbit/s 
Ethernet network of 4 SparcStation 10s [30]. The multi- 
RPC is sufficient for the primary to invoke a state machine 
operation but for an arbitrary client to invoke an operation 
it would be necessary to add an extra message delay and 
an extra RSA signature and verification to authenticate 
the client; this would lead to a latency of at least 65ms 
(using the RSA timings reported in [29].) Even if we 
divide this latency by 1.7, the ratio of the SPECint92 
ratings of the DEC 3000/400 and the SparcStation 10, our 
algorithm still reduces the latency to invoke the read-write 
and read-only 0/0 operations by factors of more than 10 
and 20, respectively. Note that this scaling is conservative 
because the network accounts for a significant fraction 
of Rampart’s latency [29] and Rampart’s results were 
obtained using 300-bit modulus RSA signatures, which 
are not considered secure today unless the keys used to 


generate them are refreshed very frequently. 

There are no published performance numbers for 
SecureRing [16] but it would be slower than Rampart 
because its algorithm has more message delays and 
signature operations in the critical path. 


7.3 Andrew Benchmark 


The Andrew benchmark [15] emulates a software 
development workload. It has five phases: (1) creates 
subdirectories recursively; (2) copies a source tree; (3) 
examines the status of all the files in the tree without 
examining their data; (4) examines every byte of data in 
all the files; and (5) compiles and links the files. 

We use the Andrew benchmark to compare BFS with 
two other file system configurations: NFS-std, which is 
the NFS V2 implementation in Digital Unix, and BFS-nr, 
which is identical to BFS but with no replication. BFS-nr 
ran two simple UDP relays on the client, and on the server 
it ran a thin veneer linked with a version of snfsd from 
which all the checkpoint management code was removed. 
This configuration does not write modified file system 
state to disk before replying to the client. Therefore, it 
does not implement NFS V2 protocol semantics, whereas 
both BFS and NFS-std do. 

Out of the 18 operations in the NFS V2 protocol only 
getattr is read-only because the time-last-accessed 
attribute of files and directories is set by operations 
that would otherwise be read-only, e.g., read and 
Lookup. The result is that our optimization for read- 
only operations can rarely be used. To show the impact 
of this optimization, we also ran the Andrew benchmark 
on a second version of BFS that modifies the Lookup 
operation to be read-only. This modification violates 
strict Unix file system semantics but is unlikely to have 
adverse effects in practice. 

For all configurations, the actual benchmark code ran 
at the client workstation using the standard NFS client 
implementation in the Digital Unix kernel with the same 
mount options. The most relevant of these options for 
the benchmark are: UDP transport, 4096-byte read and 
write buffers, allowing asynchronous client writes, and 
allowing attribute caching. 

We report the mean of 10 runs of the benchmark for 
each configuration. The sample standard deviation for 
the total time to run the benchmark was always below 
2.6% of the reported value but it was as high as 14% for 
the individual times of the first four phases. This high 
variance was also present in the NFS-std configuration. 
The estimated error for the reported mean was below 
4.5% for the individual phases and 0.8% for the total. 

Table 2 shows the results for BFS and BFS-nr. The 
comparison between BFS-strict and BFS-nr shows that 
the overhead of Byzantine fault tolerance for this service 
is low — BFS-strict takes only 26% more time to run 


12 


BFS 
phase strict r/o lookup BFS-nr 
1 0.55 (57%) | 0.47 (84%) 0.35 
2 9.24 (82%) 7.91 (56%) 5.08 
3 7.24 (18%) 6.45 (6%) 6.11 
4 8.77 (18%) 7.87 (6%) 7.41 
5 38.68 (20%) | 38.38 (19%) | 32.12 
total | 64.48 (26%) | 61.07 (20%) | 51.07 


Table 2: Andrew benchmark: BFS vs BFS-nr. The times 
are in seconds. 


the complete benchmark. The overhead is lower than 
what was observed for the micro-benchmarks because 
the client spends a significant fraction of the elapsed time 
computing between operations, i.e., between receiving 
the reply to an operation and issuing the next request, 
and operations at the server perform some computation. 
But the overhead is not uniform across the benchmark 
phases. The main reason for this is a variation in the 
amount of time the client spends computing between 
operations; the first two phases have a higher relative 
overhead because the client spends approximately 40% 
of the total time computing between operations, whereas 
it spends approximately 70% during the last three phases. 

The table shows that applying the read-only optimiza- 
tion to Lookup improves the performance of BFS sig- 
nificantly and reduces the overhead relative to BFS-nr 
to 20%. This optimization has a significant impact in 
the first four phases because the time spent waiting for 
lookup operations to complete in BFS-strict is at least 
20% of the elapsed time for these phases, whereas it is 
less than 5% of the elapsed time for the last phase. 


BFS 
phase strict r/o lookup | NFS-std 
1 0.55 (-69%) | 0.47 (-73%) 1.75 
2 9.24 (-2%) | 7.91 (-16%) 9.46 
3 7.24 (35%) | 6.45 (20%) 5.36 
4 8.77 (32%) | 7.87 (19%) 6.60 
5 38.68 (-2%) | 38.38 (-2%) 39.35 
total | 64.48 (3%) | 61.07 (-2%) 62.52 


Table 3: Andrew benchmark: BFS vs NFS-std. The 
times are in seconds. 


Table 3 shows the results for BFS vs NFS-std. These 
results show that BFS can be used in practice — BFS- 
strict takes only 3% more time to run the complete 
benchmark. Thus, one could replace the NFS V2 
implementation in Digital Unix, which is used daily 
by many users, by BFS without affecting the latency 
perceived by those users. Furthermore, BFS with the 
read-only optimization for the lookup operation is 
actually 2% faster than NFS-std. 

The overhead of BFS relative to NFS-std is not the 


same for all phases. Both versions of BFS are faster 
than NFS-std for phases 1, 2, and 5 but slower for the 
other phases. This is because during phases 1, 2, and 5a 
large fraction (between 21% and 40%) of the operations 
issued by the client are synchronous, i.e., operations that 
require the NFS implementation to ensure stability of 
modified file system state before replying to the client. 
NFS-std achieves stability by writing modified state to 
disk whereas BFS achieves stability with lower latency 
using replication (as in Harp [20]). NFS-std is faster than 
BFS (and BFS-nr) in phases 3 and 4 because the client 
issues no synchronous operations during these phases. 


8 Related Work 


Most previous work on replication techniques ignored 
Byzantine faults or assumed a synchronous system 
model (e.g., [17, 26, 18, 34, 6, 10]). Viewstamped 
replication [26] and Paxos [18] use views with a primary 
and backups to tolerate benign faults in an asynchronous 
system. Tolerating Byzantine faults requires amuch more 
complex protocol with cryptographic authentication, an 
extra pre-prepare phase, and a different technique to 
trigger view changes and select primaries. Furthermore, 
our system uses view changes only to select anew primary 
but never to select a different set of replicas to form the 
new view as in [26, 18]. 

Some agreement and consensus algorithms tolerate 
Byzantine faults in asynchronous systems (e.g,[2, 3, 24]). 
However, they do not provide a complete solution for 
state machine replication, and furthermore, most of them 
were designed to demonstrate theoretical feasibility and 
are too slow to be used in practice. Our algorithm 
during normal-case operation is similar to the Byzantine 
agreement algorithm in [2] but that algorithm is unable 
to survive primary failures. 

The two systems that are most closely related to our 
work are Rampart [29, 30, 31, 22] and SecureRing [16]. 
They implement state machine replication but are more 
than an order of magnitude slower than our system and, 
most importantly, they rely on synchrony assumptions. 

Both Rampart and SecureRing must exclude faulty 
replicas from the group to make progress (e.g., to remove 
a faulty primary and elect a new one), and to perform 
garbage collection. They rely on failure detectors 
to determine which replicas are faulty. | However, 
failure detectors cannot be accurate in an asynchronous 
system [21], i.e., they may misclassify a replica as faulty. 
Since correctness requires that fewer than 1/3 of group 
members be faulty, a misclassification can compromise 
correctness by removing a non-faulty replica from the 
group. This opens an avenue of attack: an attacker 
gains control over a single replica but does not change 
its behavior in any detectable way; then it slows correct 
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replicas or the communication between them until enough 
are excluded from the group. 

To reduce the probability of misclassification, failure 
detectors can be calibrated to delay classifying a replica 
as faulty. However, for the probability to be negligible 
the delay must be very large, which is undesirable. For 
example, if the primary has actually failed, the group will 
be unable to process client requests until the delay has 
expired. Our algorithm is not vulnerable to this problem 
because it never needs to exclude replicas from the group. 

Phalanx [23, 25] applies quorum replication tech- 
niques [12] to achieve Byzantine fault-tolerance in asyn- 
chronous systems. This work does not provide generic 
state machine replication; instead, it offers a data reposi- 
tory with operations to read and write individual variables 
and to acquire locks. The semantics it provides for read 
and write operations are weaker than those offered by our 
algorithm; we can implement arbitrary operations that ac- 
cess any number of variables, whereas in Phalanx it would 
be necessary to acquire and release locks to execute such 
operations. There are no published performance num- 
bers for Phalanx but we believe our algorithm is faster 
because it has fewer message delays in the critical path 
and because of our use of MACs rather than public key 
cryptography. The approach in Phalanx offers the poten- 
tial for improved scalability; each operation is processed 
by only a subset of replicas. But this approach to scala- 
bility is expensive: it requires n > 4f + 1 to tolerate f 
faults; each replica needs a copy of the state; and the load 
on each replica decreases slowly with n (it is O(1/./n)). 


9 Conclusions 


This paper has described a new state-machine replication 
algorithm that is able to tolerate Byzantine faults and can 
be used in practice: it is the first to work correctly in 
an asynchronous system like the Internet and it improves 
the performance of previous algorithms by more than an 
order of magnitude. 

The paper also described BFS, a Byzantine-fault- 
tolerant implementation of NFS. BFS demonstrates that 
it is possible to use our algorithm to implement real 
services with performance close to that of an unreplicated 
service — the performance of BFS is only 3% worse than 
that of the standard NFS implementation in Digital Unix. 
This good performance is due to a number of important 
optimizations, including replacing public-key signatures 
by vectors of message authentication codes, reducing 
the size and number of messages, and the incremental 
checkpoint-management techniques. 

One reason why Byzantine-fault-tolerant algorithms 
will be important in the future is that they can allow 
systems to continue to work correctly even when there 
are software errors. Not all errors are survivable; 
our approach cannot mask a software error that occurs 


at all replicas. However, it can mask errors that 
occur independently at different replicas, including 
nondeterministic software errors, which are the most 
problematic and persistent errors since they are the 
hardest to detect. In fact, we encountered such a software 
bug while running our system, and our algorithm was able 
to continue running correctly in spite of it. 

There is still much work to do on improving our system. 
One problem of special interest is reducing the amount 
of resources required to implement our algorithm. The 
number of replicas can be reduced by using f replicas 
as witnesses that are involved in the protocol only when 
some full replica fails. We also believe that it is possible 
to reduce the number of copies of the state to f + 1 but 
the details remain to be worked out. 
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